The purpose of this project was to identify clustering patterns of crime data throughout the city of Chicago. I started by importing all available data and reviewed the temporal distributions of each individual crime category. Three crime categories were selected for further analysis: Homicide, Assault, and Motor Vehicle Theft. I performed DBSCAN, HDBSCAN, and Kmeans clustering on each crime category to identify density and spatial patterns for the crimes. I discovered that the crimes do not appear in a random manner. There tend to be similarities between clusters of different crimes, indicating that there could be socio-economic factors affecting the spatial distribution of crime throughout the City.
Our first task was to import the data and view the hourly, daily, and monthly distribution of crimes in the city. The figures for the entire data set are included in the appendix. I found it more insightful to graph the hourly crime distribution in a stacked histogram to compare temporal distribution between crime categories. As one would expect, crime frequency in all categories tended to increase at night.
Spatially mapping the data did not provide much insight initially. There were too many points to really identify patterns within the data. A shot of the interactive map is included below.
library(stringr)
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.3.3
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringi)
## Warning: package 'stringi' was built under R version 4.3.3
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(mapview)
library(leafpop)
library(spatstat)
## Warning: package 'spatstat' was built under R version 4.3.3
## Loading required package: spatstat.data
## Loading required package: spatstat.geom
## Warning: package 'spatstat.geom' was built under R version 4.3.3
## spatstat.geom 3.2-9
## Loading required package: spatstat.random
## Warning: package 'spatstat.random' was built under R version 4.3.3
## spatstat.random 3.2-3
## Loading required package: spatstat.explore
## Warning: package 'spatstat.explore' was built under R version 4.3.3
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## spatstat.explore 3.2-7
## Loading required package: spatstat.model
## Warning: package 'spatstat.model' was built under R version 4.3.3
## Loading required package: rpart
## spatstat.model 3.2-11
## Loading required package: spatstat.linnet
## Warning: package 'spatstat.linnet' was built under R version 4.3.3
## spatstat.linnet 3.1-5
##
## spatstat 3.0-8
## For an introduction to spatstat, type 'beginner'
library(dbscan)
## Warning: package 'dbscan' was built under R version 4.3.3
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(RColorBrewer)
library(nngeo)
## Warning: package 'nngeo' was built under R version 4.3.3
library(tibble)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(png)
library(spatstat)
library(tidyr)
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
knitr::opts_chunk$set(echo=TRUE, include = TRUE, fig.width = 5, fig.height = 3,
fig.align = 'center')
data=read.csv('CrimesChicago20220225.csv')
data=data[,c('DATE..OF.OCCURRENCE','PRIMARY.DESCRIPTION','BEAT',
'WARD','LATITUDE','LONGITUDE')]
data=na.omit(data)
data$DATE..OF.OCCURRENCE=as.POSIXlt(data$DATE..OF.OCCURRENCE,
tz='America/Detroit',
tryFormats ='%m/%d/%Y %I:%M:%S %p')
data=data%>%
rename(DATE=DATE..OF.OCCURRENCE,DESCRIPTION=PRIMARY.DESCRIPTION)
data=data%>%
rename_with(.fn = function(x){str_to_title(x)})%>%
mutate(Day=wday(Date,label=TRUE)) %>%
mutate(Month=month(Date,label=TRUE))%>%
mutate(Hour=hour(Date))
#Hourly Crime Distribution
ggplot(data = data, aes(x=Hour))+
stat_count(fill='lightblue',color='black')+
labs(title = 'Hourly Crime Distribution')+
ylab('Frequency')+
scale_x_discrete(breaks=1:23, limits=as.character(1:23))+
theme_gray()+
theme(plot.title = element_text(hjust = 0.5, size = 12))
#daily crime distribution
knitr::opts_chunk$set(echo = TRUE)
ggplot(data = data, aes(x=Day))+
stat_count(fill='lightblue',color='black')+
labs(title='Daily Crime Distribution')+
xlab('Day')+
ylab('Frequency')+
scale_x_discrete()+
theme_gray()+
theme(plot.title = element_text(hjust = 0.5, size = 12))
#monthly crime distribution
ggplot(data = data, aes(x=Month))+
stat_count(fill='lightblue',color='black')+
labs(title='Monthly Crime Distribution')+
ylab('Frequency')+
scale_x_discrete()+
theme_gray()+
theme(plot.title = element_text(hjust = 0.5, size = 12))
#monthly crime distribution
ggplot(data = data, aes(x=Description))+
stat_count(fill='lightblue',color='black')+
labs(title='Crime Distribution by Category')+
ylab('Frequency')+
theme_grey()+
theme(axis.text.y = element_text(size = 7))+
coord_flip()+
theme(plot.title = element_text(hjust = 0.5, size = 12))
#group according to crime "category"
data_filtered=data%>%
mutate(Category=case_when(str_detect(Description,'HOMICIDE')~'Homicide',
str_detect(Description,'ASSAULT')~'Assault',
str_detect(Description,'MOTOR VEHICLE THEFT')~'Vehicle Theft',
.default = 'Other' ))%>%
filter(str_detect(Category,'Homicide|Assault|Vehicle Theft'))
#hourly crime distribution by category of crime
data_filtered%>%
group_by(Category,Hour)%>%
ggplot(aes(x=Hour,fill=Category))+
labs(title='Hourly Crime Distribution')+
ylab('Frequency')+
stat_count(color='black')+
scale_x_discrete(breaks=1:23, limits=as.character(1:23))+
theme_grey()+
theme(plot.title = element_text(hjust = 0.5, size = 12))
#mapping
data_sf=st_as_sf(data_filtered,
coords = c('Longitude','Latitude'),
crs = 'EPSG:4326')
data_sf=data_sf%>%
filter(!is.na(geometry))
shp=list.files(pattern = '.shp')
mapviewOptions(basemaps='OpenStreetMap', fgb = FALSE)
crime_map=mapview(data_sf,
zcol='Category',
col.regions=brewer.pal(3,'Dark2'),
legend=TRUE,
popup=popupTable(x=data_sf,
row.numbers=FALSE,
feature.id=FALSE),
layer.name='Crime Spatial Distribution')
Spatially mapping the data did not provide much insight initially. There were too many points to really identify patterns within the data. A shot of the interactive map is included below.
crime_map